Influence of Feature Encoding and Choice of Classifier on Disease Risk Prediction in Genome-Wide Association Studies
نویسنده
چکیده
Various attempts have been made to predict the individual disease risk based on genotype data from genome-wide association studies (GWAS). However, most studies only investigated one or two classification algorithms and feature encoding schemes. In this study, we applied seven different classification algorithms on GWAS case-control data sets for seven different diseases to create models for disease risk prediction. Further, we used three different encoding schemes for the genotypes of single nucleotide polymorphisms (SNPs) and investigated their influence on the predictive performance of these models. Our study suggests that an additive encoding of the SNP data should be the preferred encoding scheme, as it proved to yield the best predictive performances for all algorithms and data sets. Furthermore, our results showed that the differences between most state-of-the-art classification algorithms are not statistically significant. Consequently, we recommend to prefer algorithms with simple models like the linear support vector machine (SVM) as they allow for better subsequent interpretation without significant loss of accuracy.
منابع مشابه
Genome Wide Association Studies, Next Generation Sequencing and Their Application in Animal Breeding and Genetics: A Review
Recently genetic studies have been revolutionized by next generation sequencing (NGS) technology, and it is expected that the use of this technology will largely eliminate defects in the methods of association studies. The NGS technology is becoming the premier tool in genetics. However, at the moment the use of this method is limited especially in the livestock due to high cost and computation...
متن کاملEvaluation of Classifiers in Software Fault-Proneness Prediction
Reliability of software counts on its fault-prone modules. This means that the less software consists of fault-prone units the more we may trust it. Therefore, if we are able to predict the number of fault-prone modules of software, it will be possible to judge the software reliability. In predicting software fault-prone modules, one of the contributing features is software metric by which one ...
متن کاملHeritability for Stroke: Essential for Taking Family History
There are many well-established factors that influence the risk of stroke including blood pressure, diabetes, low socioeconomic status and smoking, however, the shared genetic resource in members of a family effect on stroke predisposition. Genome-wide association studies (GWAS) have demonstrated evidence of a shared genetic source in stroke risk. This review considered the influence of family...
متن کاملGenome-wide Association Study to Identify Genes and Biological Pathways Associated with Type Traits in Cattle using Pathway Analysis
Extended Abstract Introduction and Objective: Type traits describing the skeletal characteristics of an animal are moderately to strongly genetically correlate with other economically important traits in cattle including fertility, longevity and carcass traits. The present study aimed to conduct a genome wide association studies (GWAS) based on gene-set enrichment analysis for identifying the ...
متن کاملP83: Role of Neuregulin 3 Genes Expression on Attention Deficits in Schizophrenia
Genetic epidemiological studies strongly suggest that additive and interactive genes, each with small effects, mediate the genetic vulnerability for schizophrenia. With the human genome working draft at hand, candidate gene (and ultimately large-scale genome-wide) association studies are gaining renewed interest in the effort to unravel the complex genetics of schizophrenia. Linkage and fine ma...
متن کامل